12 research outputs found
Lessons from Building Acoustic Models with a Million Hours of Speech
This is a report of our lessons learned building acoustic models from 1
Million hours of unlabeled speech, while labeled speech is restricted to 7,000
hours. We employ student/teacher training on unlabeled data, helping scale out
target generation in comparison to confidence model based methods, which
require a decoder and a confidence model. To optimize storage and to
parallelize target generation, we store high valued logits from the teacher
model. Introducing the notion of scheduled learning, we interleave learning on
unlabeled and labeled data. To scale distributed training across a large number
of GPUs, we use BMUF with 64 GPUs, while performing sequence training only on
labeled data with gradient threshold compression SGD using 16 GPUs. Our
experiments show that extremely large amounts of data are indeed useful; with
little hyper-parameter tuning, we obtain relative WER improvements in the 10 to
20% range, with higher gains in noisier conditions.Comment: "Copyright 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works.
Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting
We propose a max-pooling based loss function for training Long Short-Term
Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low
CPU, memory, and latency requirements. The max-pooling loss training can be
further guided by initializing with a cross-entropy loss trained network. A
posterior smoothing based evaluation approach is employed to measure keyword
spotting performance. Our experimental results show that LSTM models trained
using cross-entropy loss or max-pooling loss outperform a cross-entropy loss
trained baseline feed-forward Deep Neural Network (DNN). In addition,
max-pooling loss trained LSTM with randomly initialized network performs better
compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss
trained LSTM initialized with a cross-entropy pre-trained network shows the
best performance, which yields relative reduction compared to baseline
feed-forward DNN in Area Under the Curve (AUC) measure
Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations
Load imbalance pervasively exists in distributed deep learning training
systems, either caused by the inherent imbalance in learned tasks or by the
system itself. Traditional synchronous Stochastic Gradient Descent (SGD)
achieves good accuracy for a wide variety of tasks, but relies on global
synchronization to accumulate the gradients at every training step. In this
paper, we propose eager-SGD, which relaxes the global synchronization for
decentralized accumulation. To implement eager-SGD, we propose to use two
partial collectives: solo and majority. With solo allreduce, the faster
processes contribute their gradients eagerly without waiting for the slower
processes, whereas with majority allreduce, at least half of the participants
must contribute gradients before continuing, all without using a central
parameter server. We theoretically prove the convergence of the algorithms and
describe the partial collectives in detail. Experimental results on
load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show
that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous
SGD, without losing accuracy.Comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 202
SparCML: High-Performance Sparse Communication for Machine Learning
Applying machine learning techniques to the quickly growing data in science
and industry requires highly-scalable algorithms. Large datasets are most
commonly processed "data parallel" distributed across many nodes. Each node's
contribution to the overall gradient is summed using a global allreduce. This
allreduce is the single communication and thus scalability bottleneck for most
machine learning workloads. We observe that frequently, many gradient values
are (close to) zero, leading to sparse of sparsifyable communications. To
exploit this insight, we analyze, design, and implement a set of
communication-efficient protocols for sparse input data, in conjunction with
efficient machine learning algorithms which can leverage these primitives. Our
communication protocols generalize standard collective operations, by allowing
processes to contribute arbitrary sparse input data vectors. Our generic
communication library, SparCML, extends MPI to support additional features,
such as non-blocking (asynchronous) operations and low-precision data
representations. As such, SparCML and its techniques will form the basis of
future highly-scalable machine learning frameworks
Generation and Minimization of Word Graphs in Continuous
in the forward and backward passes, it is guaranteed that the generated graph is the minimal word-lattice, containing exactly the paths that have a higher score than the threshold. This includes all the different alignments of each wordstring. For re-scoring using new acoustical models, the word-lattice constructed above is the optimal minimal representation because it is desirable to re-score the different alignments. However, for re-scoring using only a new grammar, a more compact representation is better. Minimizing a word-lattice is equivalent to minimizing a nondeterministic finite-state automaton (NFA) which is a hard problem that can not in general be solved in polynomial time. Therefore, the problem has been attacked using heuristic methods that reduce the graph but not to the minimal size. In particular, the so called word-pair approximation has been applied [6]. In this study we instead approached the problem by applying the classical algorithms for: 1) constructing an equi
Multi-Geometry Spatial Acoustic Modeling for Distant Speech Recognition
The use of spatial information with multiple microphones can improve
far-field automatic speech recognition (ASR) accuracy. However, conventional
microphone array techniques degrade speech enhancement performance when there
is an array geometry mismatch between design and test conditions. Moreover,
such speech enhancement techniques do not always yield ASR accuracy improvement
due to the difference between speech enhancement and ASR optimization
objectives. In this work, we propose to unify an acoustic model framework by
optimizing spatial filtering and long short-term memory (LSTM) layers from
multi-channel (MC) input. Our acoustic model subsumes beamformers with multiple
types of array geometry. In contrast to deep clustering methods that treat a
neural network as a black box tool, the network encoding the spatial filters
can process streaming audio data in real time without the accumulation of
target signal statistics. We demonstrate the effectiveness of such MC neural
networks through ASR experiments on the real-world far-field data. We show that
our two-channel acoustic model can on average reduce word error rates (WERs)
by~13.4 and~12.7% compared to a single channel ASR system with the log-mel
filter bank energy (LFBE) feature under the matched and mismatched microphone
placement conditions, respectively. Our result also shows that our two-channel
network achieves a relative WER reduction of over~7.0% compared to conventional
beamforming with seven microphones overall.Comment: ICASSP2019, 5 pages. arXiv admin note: substantial text overlap with
arXiv:1903.0529
Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition
Conventional far-field automatic speech recognition (ASR) systems typically
employ microphone array techniques for speech enhancement in order to improve
robustness against noise or reverberation. However, such speech enhancement
techniques do not always yield ASR accuracy improvement because the
optimization criterion for speech enhancement is not directly relevant to the
ASR objective. In this work, we develop new acoustic modeling techniques that
optimize spatial filtering and long short-term memory (LSTM) layers from
multi-channel (MC) input based on an ASR criterion directly. In contrast to
conventional methods, we incorporate array processing knowledge into the
acoustic model. Moreover, we initialize the network with beamformers'
coefficients. We investigate effects of such MC neural networks through ASR
experiments on the real-world far-field data where users are interacting with
an ASR system in uncontrolled acoustic environments. We show that our MC
acoustic model can reduce a word error rate (WER) by~16.5\% compared to a
single channel ASR system with the traditional log-mel filter bank energy
(LFBE) feature on average. Our result also shows that our network with the
spatial filtering layer on two-channel input achieves a relative WER reduction
of~9.5\% compared to conventional beamforming with seven microphones.Comment: ICASSP 2019, 5 page